Modul 12 von 13 · 📖 4 min Lesezeit · ⏱ 30 min gesamt

FI-DPA 13 Maschinelles Lernen — Grundlagen und Algorithmen (EN)

Inhaltsverzeichnis (5 Abschnitte)
  1. Concepts and Background
  2. Practical Steps
  3. Common Pitfalls
  4. Further Resources
  5. Knowledge Check

FI-DPA 13 Machine Learning — Fundamentals and Algorithms

Module 13 covers the fundamental concepts of machine learning, including the distinction between supervised and unsupervised learning. You will learn the most important algorithms for regression, classification, and clustering, as well as the concepts of bias-variance dilemma and principal component analysis (PCA).

The practical application of these concepts is demonstrated using typical algorithms such as Decision Tree, Random Forest, k-NN, k-Means, and PCA. Upon completion of this module, you will be able to evaluate and apply appropriate ML methods for given problem statements.

Concepts and Background

Supervised Learning
Supervised learning uses labeled training data where each input is provided with the correct output. The goal is to learn a function that can correctly predict new, unseen data. Examples include classification and regression.
Unsupervised Learning
Unsupervised learning works with unlabeled data and independently seeks hidden patterns or structures in the data. Typical applications are clustering and dimensionality reduction.
Regression
Regression is a form of supervised learning where the goal is to predict a continuous value. Examples include predicting prices or temperatures.
Classification
Classification is also a form of supervised learning where data is divided into predefined categories. Examples include spam email detection or disease diagnosis.
Clustering
Clustering is a method of unsupervised learning where similar data points are grouped together into clusters (groups). The goal is to discover the data structure.

Practical Steps

  1. Prepare data: Load your dataset into a suitable format (e.g., CSV) and prepare it by handling missing values and encoding categorical variables. Proper data preprocessing is crucial for model quality.
  2. Split data into training and test sets: Use the train_test_split function from scikit-learn to divide your data into training and test datasets. This enables an objective evaluation of the model.
  3. Select and initialize model: Choose an appropriate algorithm for your problem (e.g., RandomForestClassifier for classification) and initialize the model with suitable parameters. The choice of the right algorithm depends heavily on the nature of your data and the problem.
  4. Train model: Fit the model to your training data by calling the fit method. During this process, the model learns the underlying patterns in the data.
  5. Evaluate model: Use metrics such as accuracy, precision, or F1-score to evaluate the model's performance on the test set. This provides insight into the model's generalization ability.
  6. Optimize model: Use techniques like GridSearchCV to optimize the model's hyperparameters. Careful hyperparameter optimization can significantly improve model performance.

Common Pitfalls

Further Resources

Knowledge Check

Four questions for self-assessment. Click on each question to see the correct answer and explanation.

What is the main difference between supervised and unsupervised learning?
  • A) Supervised learning always uses neural networks, unsupervised learning does not
  • B) Supervised learning requires labeled data, unsupervised learning works with unlabeled data
  • C) Supervised learning is always more accurate than unsupervised learning
  • D) Supervised learning can only work with numerical data, unsupervised learning can also work with categorical data

Correct Answer: B. The key difference lies in the use of labeled data in supervised learning, while unsupervised learning works without predefined labels. Option A is incorrect as both learning forms include various algorithms. Option C is not generally valid as accuracy depends on the problem statement. Option D is incorrect as both learning forms can work with different data types.

Which category of machine learning does predicting house prices based on features like size, location, and year of construction belong to?
  • A) Classification
  • B) Clustering
  • C) Regression
  • D) Principal Component Analysis

Correct Answer: C. Regression is the prediction of continuous values like prices. Classification would be incorrect as it categorizes data. Clustering is unsupervised learning and PCA is used for dimensionality reduction, not prediction.

What problem arises when a machine learning model is too closely fitted to the training data?
  • A) Underfitting
  • B) Overfitting
  • C) The Bias-Variance Dilemma
  • D) The Problem of High Dimensionality

Correct Answer: B.<st